A Bayesian approach for supervised discretization
نویسنده
چکیده
In supervised machine learning, some algorithms are restricted to discrete data and thus need to discretize continuous attributes. In this paper, we present a new discretization method called MODL, based on a Bayesian approach. The MODL method relies on a model space of discretizations and on a prior distribution defined on this model space. This allows the setting up of an evaluation criterion of discretization, which is minimal for the most probable discretization given the data, i.e. the Bayes optimal discretization. We compare this approach with the MDL approach and statistical approaches used in other discretization methods, from a theoretical and experimental point of view. Extensive experiments show that the MODL method builds high quality discretizations.
منابع مشابه
Discretizing Continuous Attributes While Learning Bayesian Networks
We introduce a method for learning Bayesian networks that handles the discretization of continuous variables as an integral part of the learning process. The main ingredient in this method is a new metric based on the Minimal Description Length principle for choosing the threshold values for the discretization while learning the Bayesian network structure. This score balances the complexity of ...
متن کاملOptimum simultaneous discretization with data grid models in supervised classification: a Bayesian model selection approach
In the domain of data preparation for supervised classification, filter methods for variable ranking are time efficient. However, their intrinsic univariate limitation prevents them from detecting redundancies or constructive interactions between variables. This paper introduces a new method to automatically, rapidly and reliably extract the classificatory information of a pair of input variabl...
متن کاملOptimal Bayesian 2D-Discretization for Variable Ranking in Regression
In supervised machine learning, variable ranking aims at sorting the input variables according to their relevance w.r.t. an output variable. In this paper, we propose a new relevance criterion for variable ranking in a regression problem with a large number of variables. This criterion comes from a discretization of both input and output variables, derived as an extension of a Bayesian non para...
متن کاملA Bayesian Discretizer for Real-Valued Attributes
Discretization of real-valued attributes into nominal intervals has been an important area for symbolic induction systems because many real world classiication tasks involve both symbolic and numerical attributes. Among various supervised and unsupervised discretization methods, the information gain based methods have been widely used and cited. This paper designs a new discretization method, c...
متن کاملA New Hybrid Framework for Filter based Feature Selection using Information Gain and Symmetric Uncertainty (TECHNICAL NOTE)
Feature selection is a pre-processing technique used for eliminating the irrelevant and redundant features which results in enhancing the performance of the classifiers. When a dataset contains more irrelevant and redundant features, it fails to increase the accuracy and also reduces the performance of the classifiers. To avoid them, this paper presents a new hybrid feature selection method usi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004